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Abstract 

Background: Sequence alignment has become an indispensable tool in modern molecular biology research, and 
probabilistic sequence alignment models have been shown to provide an effective framework for building 
accurate sequence alignment tools. One such example is the pair hidden Markov model (pair-HMM), which has 
been especially popular in comparative sequence analysis for several reasons, including their effectiveness in 
modeling and detecting sequence homology, model simplicity, and the existence of efficient algorithms for 
applying the model to sequence alignment problems. However, despite these advantages, pair-HMMs also have a 
number of practical limitations that may degrade their alignment performance or render them unsuitable for 
certain alignment tasks. 

Results: In this work, we propose a novel scheme for comparing and aligning biological sequences that can 
effectively address the shortcomings of the traditional pair-HIVlMs. The proposed scheme is based on a simple 
message-passing approach, where messages are exchanged between neighboring symbol pairs that may be 
potentially aligned in the optimal sequence alignment. The message-passing process yields probabilistic symbol 
alignment confidence scores, which may be used for predicting the optimal alignment that maximizes the 
expected number of correctly aligned symbol pairs. 

Conclusions: Extensive performance evaluation on protein alignment benchmark datasets shows that the 
proposed message-passing scheme clearly outperforms the traditional pair-HMM-based approach, in terms of both 
alignment accuracy and computational efficiency. Furthermore, the proposed scheme is numerically robust and 
amenable to massive parallelization. 



Background 

Sequence alignment has become an indispensable tool in 
modern molecular biology research, as it provides an 
effective and intuitive way of comparing and analyzing 
biological sequences. Given a set of biological sequences, 
the primary objective of sequence alignment is to predict 
the best overall mapping between the sequences, which 
accurately aligns the homologous regions that are 
embedded in them. This provides an effective means for 
detecting conserved sequence regions with potentially 
important functional roles. The concept of sequence 
alignment has had diverse applications in biomedical 
research [1-7], which include homology search, function 
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and structure prediction of biomolecules, phylogenetic 
analysis, and detecting sequence motifs, among others. 

Typically, sequence alignment is carried out by formu- 
lating and solving an optimization problem - either impli- 
citly or explicitly - where the goal is to maximize an 
objective function that measures the overall quality of the 
sequence alignment. For example, one simple way of align- 
ing a sequence pair would be to score each potential align- 
ment by assigning a "substitution score" to every aligned 
symbol pair and penalty scores for gaps and then find the 
optimal alignment that maximizes the overall score 
through dynamic programming [1]. In the past, various ad 
hoc scoring schemes have been proposed to obtain intui- 
tive and biologically meaningful sequence alignment 
results. As an alternative to heuristic scoring schemes, 
there have been also research efforts to develop probabilis- 
tic models for sequence alignment that can be used to 



© 2014 Yoon; licensee BiolVled Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons 
Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver {http:// 
creativecommons,org/publicdomain/zero/1.0/) applies to the data made availablein this article, unless otherwise stated. 



Yoon BMC Genomics 2014, 15(Suppl 1):S14 
http://www.biomedcentral.com/1471-2164/15/Sl/S14 



Page 2 of 8 



evaluate and compare potential alignments and to estimate 
the symbol-to-symbol alignment probabilities. 

Examples of such probabilistic schemes include the 
pair hidden Markov models (pair-HMMs) [1] and the 
partition function based scheme [8]. Given two biologi- 
cal sequences, these methods can be used to estimate 
the posterior symbol alignment probability for each 
symbol pair that may be aligned in the final sequence 
alignment. Based on the estimated probabilities, we can 
predict the optimal sequence alignment that contains 
the largest expected number of correctly aligned symbol 
pairs, rather than an alignment that maximizes an ad hoc 
score. This is typically referred to as the maximum 
expected accuracy (MEA) alignment [9-11], and as before, 
it can be also found through dynamic programming. 

Among a number of probabilistic sequence alignment 
models, pair-HMMs have been especially popular, and 
they have been widely adopted by many multiple sequence 
alignment (MSA) algorithms, including ProbCons [9] and 
PicXAA [10]. Despite the simplicity of the model, pair- 
HMMs have been shown to be very effective in modeling 
sequence homology, as reflected in the well-rounded over- 
all performance of various MSA algorithms that utilize the 
symbol alignment probabilities estimated by pair-HMMs. 
Furthermore, these probabilities can be estimated in a 
relatively efficient manner, making the pair-HMMs an 
attractive choice for various sequence alignment problems. 
However, pair-HMMs also have a number of shortcom- 
ings, which may negatively affect their alignment perfor- 
mance or make them impractical for certain alignment 
tasks. 

In this paper, we propose a novel scheme for compar- 
ing and aligning biological sequences that can effectively 
address the limitations of pair-HMMs. The proposed 
scheme computes probabilistic symbol alignment confi- 
dence scores based on a simple and computationally effi- 
cient message-passing approach. As we will demonstrate 
in this paper, this message-passing scheme has a number 
of important advantages over the traditional pair-HMMs 
and it clearly outperforms pair-HMMs in terms of both 
speed and accuracy on protein alignment benchmark 
datasets. 

Methods 

A brief overview of pair hidden Markov models 

The pair-HMM [1,12] is a generative sequence model that 
can simultaneously generate a pair of aligned symbol 
sequences. This is different from the traditional HMMs, 
which generate only a single symbol sequence at a time 
[13]. Figure 1 shows two examples of pair-HMMs that are 
widely used in biological sequence analysis. As shown in 
Figure 1, a typical pair-HMM consists of three hidden 
states Ix, ly, and M, which are used to model insertions in 
sequence x, insertions in sequence y, and matched (i.e.. 



aligned) symbols in both sequences, respectively. The pair- 
HMM generates an aligned sequence pair (x, y) by making 
transitions between the hidden states according to the spe- 
cified state transition probabilities. At state Ix, the model 
emits a symbol only to sequence x, while at ly, a symbol is 
emitted only to sequence y. On the other hand, at state M, 
the model emits a pair of aligned symbols, where one sym- 
bol is added to x and the other symbol is added to y. 
Figure 1(C) gives an example of a sequence pair (x, y) that 
is generated by a pair-HMM. In this example, the underly- 
ing hidden state sequence that gives rise to the two 
sequences x = AACCG and y = CCGTT is IxIxMMMIyly. 
This indicates that the first two symbols (i.e., AA) in x and 
the last two symbols in y (i.e., TT) are "insertions," which 
do not have any matching counterpart in the other 
sequence, while the last three symbols in x and the first 
three symbols in y (i.e., CCG in both sequences) are jointly 
generated by the pair-HMM, hence closely match each 
other. As we can see from this example, we can unam- 
biguously identify the alignment of a given sequence pair 
(x, y), once the underlying hidden state sequence yielding 
the sequence pair is known. Of course, the hidden state 
sequence is generally not known, but there exist efficient 
algorithms that can be used for its prediction. For exam- 
ple, we can use the Viterbi algorithm [14] to predict the 
optimal hidden state sequence that maximizes the obser- 
vation probability of the sequence pair (x, y). Alternatively, 
we can also predict the state sequence that maximizes the 
expected number of correct states, by first estimating the 
alignment probabilities between the symbols in x and y 
through the forward and backward procedures [13] and 
then applying the Needleman-Wunsch algorithm [15]. 
This will lead to the MEA alignment between the two 
sequences x and y. 

Limitations of pair-HMMs 

Although the hidden state sequence of a pair-HMM 
unambiguously points to a specific sequence alignment, 
this is not necessarily true the other way around. In fact, 
several different state sequences can lead to the same 
sequence alignment, hence we may not always be able to 
unambiguously determine the underlying state sequence 
for a given pairwise sequence alignment. For example, let 
us consider two sequences x = AAACGG and y = 
AAATTA. Suppose the "true" alignment aligns only the 
first three symbols (i.e., AAA) of x and y, hence the last 
three symbols in the respective sequences are regarded as 
insertions that do not have any matching counterpart in 
the other sequence. This is illustrated below, where the 
solid lines correspond to the aligned symbols: 

x: AAACGG 

III (1) 
y: AAAT T A 
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c 

I, : insertion in X (sequence 1 ) x(seq1) ; A A c c G - - 

ly : insertion in y (sequence 2) y (seq 2); --ccgtt 

M : matched (aligned) symbols in x and y States i, i. m m m i^. 

Figure 1 Pair hidden IVIarl<ov models. (A) The state transition diagram of a widely used pair-HMM. (B) An alternative pair-HMM 
implementation that does not allow transitions between the two insertion states 1^ and ly. (C) An example of a sequence pair (x, y) that is 
generated by a pair-HMM. 



For the pair-HMM shown in Figure 1(A), any hidden 
state sequence s = S1S2 ■ ■ ■ Sg such that Si = S2 = 53 = M 
and 54S5 ■ ■ ■ S9 is a permutation of Ix Ix ly ly ly would 
lead to the sequence alignment shown in (1). When 
using this pair-HMM for predicting the optimal align- 
ment of a sequence pair with the largest probability, this 
ambiguity may lead to performance degradation as these 
potential state sequences compete against each other. 
For this reason, it is generally more desirable to estimate 
the symbol alignment probabilities via the pair-HMM by 
considering all potential alignments and state sequences 
and use the estimated probabilities to find the MEA 
alignment that is expected to have the maximum num- 
ber of correctly aligned symbols [9-11]. However, the 
aforementioned ambiguity also negatively affects the 
quality of the estimated symbol alignment probabilities, 
which is especially noticeable for sequence pairs with 
low percentage identity. In some cases, the alternative 
pair-HMM shown in Figure 1(B) is used to avoid such 
ambiguity. This alternative pair-HMM blocks transitions 
between the insertion states Ix and ly, thereby prohibiting 
the model from inserting unaligned symbols to both 
sequences. For example, the alignment shown in (1) would 
not be allowed based on this alternative pair-HMM. How- 
ever, due to this restriction, the pair-HMM in Figure 1(B) 
has a relatively stronger tendency to align unrelated 
sequence regions by treating them as mutations. This may 
again negatively affect the quality of the symbol alignment 
probabilities estimated based on the pair-HMM. 

Another potential drawback of pair-HMMs is that the 
associated algorithms (i.e., the Viterbi, forward, and 



backward algorithms) can become numerically unstable 
for long sequences. Application of pair-HMMs to biolo- 
gical sequence analysis involves computing extremely 
small probabilities, which decrease exponentially with 
the sequence length. For example, based on the pair- 
HMM that was used in [9], the observation probability 
(i.e., the probability that the HMM may generate a given 
sequence pair) of a protein pair is typically in the order 
of 10' for proteins of length 80, 10' for proteins of 
length 100, and 10'^^° for proteins of length 120. As a 
result, pair-HMM algorithms are prone to underflow 
errors, unless they are carefully implemented to keep 
them numerically robust. So far, a number of schemes 
have been proposed to address this issue, such as using 
log transformations of the probabilities or normalizing 
the probabilities to keep them within a reasonable 
numerical range, and have been shown to work well for 
relatively long sequences [1]. However, log transforma- 
tions can make the forward and backward algorithms 
considerably slower, and the normalization approach 
can still lead to underflow errors as the sequences get 
longer. 

One further disadvantage of pair-HMMs is that the 
algorithms that are used with the model cannot be easily 
parallelized. Although the Viterbi, forward, and back- 
ward algorithms for pair-HMMs are relatively efficient, 
they are still computationally expensive to be used with 
very long sequences. Moreover, as the algorithms are 
not amenable to massive parallelization, this makes the 
pair-HMMs not suitable for large-scale sequence analy- 
sis tasks, such as the whole genome alignment, despite 
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their superior performance compared to other heuristic 
methods. 

A message-passing scheme for estimating symbol 
alignment confidence scores 

Here, we propose a novel method for aUgning biological 
sequences that can effectively address the aforemen- 
tioned shortcomings of pair-HMMs. The proposed 
method is based on a message-passing scheme, where 
messages are iteratively exchanged between neighboring 
symbol pairs to estimate the level of confidence for 
potential pairwise symbol alignments. The main under- 
lying motivation is to develop an "analytical" method 
that can directly estimate the symbol alignment prob- 
abilities, without specifically modeling symbol insertions 
and deletions. This stands in contrast to the pair-HMM 
approach, which is essentially based on a "generative" 
sequence model that tries to explicitly model symbol 
insertions/deletions, in addition to symbol alignments. 
As discussed before, modeling symbol insertions in pair- 
HMMs can lead to subtle issues with potentially nega- 
tive effects, and considering that our ultimate goal lies 
in finding an accurate sequence alignment through 
effective estimation of the symbol alignment probabil- 
ities, a method that can directly estimate these probabil- 
ities without explicitly modeling insertions/deletions 
would be desirable. 

Suppose X and y = yiy2 - Jm are the two 

sequences to be aligned. We define Cxy [i, j) as the sym- 
bol alignment confidence score between xt (the i-th. 
symbol in x) and yj (the /-th symbol in y). The score 
Cxy iU i) provides a quantitative measure of confidence 
as to whether Xi and yj should be aligned to each other 
or not, and we assume Cxy (/, ;') " P{xi ~ >y|x, y), where 
P {xi ~ Jy jx, y) is the posterior symbol alignment prob- 
ability between Xi and yj given the sequences x and y. 
We estimate the alignment confidence score by itera- 
tively passing messages between neighboring symbol 
pairs, where each symbol pair {xi, y^ corresponds to a 
potential symbol alignment in the true (unknown) 
sequence alignment between x and y. For example, dur- 
ing the estimation process, the symbol pair (*,, y^ will 
exchange messages with its two neighbors (Xi^\, jy^i) and 
(^;+i) and similarly, the pair (;>c,+i, Jy+i) will 

exchange messages with (x,-, y^ and {xu%, yj+2)- The mes- 
sage-passing process is illustrated in Figure 2, where the 
solid lines indicate the messages that are used to update 
the alignment confidence score Cxy(i, /) of the symbol 
pair {xi, yj). The dashed lines correspond to messages 
that are used to update the confidence scores of other 
symbol pairs. 

The pseudocode of the proposed message-passing 
algorithm is as follows: 

STEP-1 Initialize Cxy (/,;). 



STEP-2 Update the alignment confidence 
score : 

STEP-3 Normalize c^y{i, j). 

STEP-4 If c^{i, J) has converged, then ter- 
minate the algorithm. 

Otherwise, go to STEP-2. 
In STEP-1, we first initialize the alignment confidence 
score Cxy(/, /), where we can simply use random initializa- 
tion. If a preliminary sequence alignment of x and y is 
available (e.g., obtained from a simple heuristic method), 
we can also initialize the score based on this alignment 
such that Cxy(i, /) = 1 if Xi and yj are aligned, and Cxy(/, /) = 

0 otherwise. Next, in STEP-2, the alignment confidence 
score Cxy(i, /) of the symbol pair (x„ y/) is updated based on 
the scores of its two neighbors {xi^i, j/y_i) and (xi+i, yj+i). 
Note that the score is set to Cxy(i, /) = 0 if i * {1, ... , L} or 
/' € {1, ... , M}. P {xi, yj) is the joint occurrence probability 
of the symbol pair {xi, yj), which is essentially equivalent to 
the joint emission probability of an aligned symbol pair 
{Xi, yj) at the match state M of a pair-HMM. It should be 
noted that this probability P [Xi, yj) is not location-depen- 
dent and is simply determined by the symbols and yj. 
The weight parameter A G [0, 1] is used to balance the 
contribution from the neighbors and that from the joint 
probability of [xi, yj) in estimating the alignment confi- 
dence score. A large A gives more weight to the "mes- 
sages" received from the neighbors in estimating the 
scores, which tends to penalize gaps more heavily, and it 
generally leads to longer aligned regions with fewer gaps. 
On the contrary, a small A gives more weight to the joint 
symbol occurrence probability P {xi, yj) while giving less 
weight to the messages received from the neighbors, 
which tends to be more lenient to gaps. Once the symbol 
alignment confidence score Cxy(i, /) is updated for all 

1 = 1, ... , L and / = 1, ... , M, we normalize the scores to 
keep them within a proper numerical range, as shown in 
STEP-3. For example, a simple way would be to divide 
the score matrix C = [(Cxy(i, ;)] by its matrix norm to nor- 
malize the confidence scores. After normalization, the 
updated scores are compared to the scores in the last 
iteration, and the algorithm terminates if the specified 
convergence criterion has been met. Otherwise, the algo- 
rithm goes back to STEP-2 and repeats the message-pas- 
sing process. 

Results and Discussion 

Dataset and experimental set-up 

In order to evaluate the performance of the proposed 
message-passing scheme, we carried out pairwise 
sequence alignment experiments based on the BAliBASE 
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Iteration: n-1 




Iteration: n 



Figure 2 Illustration of the proposed message-passing scheme. At iteration n, the alignment confidence score c^(i,j) of tlie symbol pair (x,, 
y_,) is updated based on the messages received from its neighbors (x,_,, y^,,) and (X/+,, yy+,) and the joint occurrence probability P(x„ y) of the 
symbols x, and y. Solid lines indicate the messages that are used to update c^y(i, f), while the dashed lines correspond to messages that are used 
to update the alignment confidence scores of other symbol pairs. 



3.0 protein alignment benchmark [16]. BAliBASE is 
arguably the most widely used benchmark for multiple 
sequence alignment, and it has been utilized by most 
multiple sequence alignment algorithms for assessing 
their performance. The benchmark consists of five refer- 
ence sets, where Reference 1 consists of two subsets: VI 
and V2. Each reference set consists of multiple sequence 
alignments that satisfy specific criteria, such that differ- 
ent reference sets can be used to test the performance 
of multiple sequence alignment algorithms under differ- 
ent conditions. For example, each alignment in Refer- 
ence 2 consists of sequences that share reasonably high 
identity (> 40%) and "orphan sequences" that share little 
identity (< 20%) to other sequences in the alignment. 
Reference sets 4 and 5 are constructed such that every 
sequence has at least one other sequence in the same 
alignment whose identity exceeds 20%. Sequences in 
Reference 4 and Reference 5 may contain large N/C- 
terminal extensions or internal insertions, respectively. 
Further details of the BAliBASE 3.0 benchmark can be 
found in [16]. 

For every sequence family in BAliBASE 3.0, we per- 
formed pairwise sequence alignment for all possible 
sequence pairs in the given family. The pairwise align- 
ment was performed in the following manner. First, we 
estimated the probabilistic symbol alignment confidence 
score using the proposed message-passing scheme. 
In our experiments, we used three different values of 



A (= 0.25, 0.5, and 0.75) to investigate the effect of A on 
the overall sequence alignment performance. For the 
joint symbol occurrence probability P {Xi, yj), we used 
the joint emission probability (at state M) of the pair- 
HMM that was used in [9]. At the end of each iteration, 
we normalized the alignment confidence score by divid- 
ing the confidence score matrix C by the matrix 2- 
norm: C 9f C/||C||2- We terminated the message- 
passing process if Ei Ej Cxy(i,;)| < 0.01, 
where c^y{i, j) is the current score and c^i},)) is 
the score obtained in the previous iteration. Once 
the scores converged, based on our assumption that 
Cxy {ii i) ■=< P [xi ~ Jj |x, y), we used the confidence score 
Cxy (') /) to find the MEA alignment through dynamic 
programming. The predicted alignment was compared 
to the benchmark alignment in BAliBASE 3.0 to com- 
pute the sensitivity (SN) = jp^^^ and the positive pre- 
dictive value (PPV) = jp^^p , where TP is the number of 
correctly aligned symbol pairs, FP is the number of 
incorrectly aligned pairs, and FN is the number of symbol 
pairs that are aligned in the benchmark alignment but 
not aligned in the predicted alignment. For comparison, 
we repeated similar experiments using the pair-HMM 
with the same set of parameters as the one used in [9] . 

Performance of the proposed message-passing scheme 

Table 1 summarizes the pairwise sequence alignment 
performance of the proposed message-passing scheme 
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Table 1 Pairwise sequence alignment performance evaluated on the BAIIBASE 3.0 benchmark. 



Ref Pair-HMM Message-Passing 













X = 0.25 






X = 0.50 






X = 0.75 






SN 


PPV 


CPU 


SN 


PPV 


CPU 


SN 


PPV 


CPU 


SN 


PPV 


CPU 


RVll 


0.048 


0.106 


1.934 


0.123 


0.149 


0.769 


0.155 


0.175 


1.465 


0.198 


0.209 


3.675 


RV12 


0213 


0.414 


2.707 


0.399 


0.468 


1.146 


0.475 


0.523 


2.145 


0.569 


0.595 


5.200 


RV20 


0.276 


0.476 


2.725 


0.504 


0.568 


1.186 


0.570 


0.613 


2.251 


0.643 


0.665 


5.531 


RV30 


0.168 


0.300 


2.656 


0.324 


0.369 


1.143 


0.372 


0.402 


2.160 


0.426 


0.441 


5.432 


RV40 


0.153 


0.271 


4.084 


0.250 


0.284 


1.760 


0.300 


0.323 


3.234 


0.361 


0.373 


7.970 


RV50 


0.140 


0.254 


4.969 


0.248 


0.278 


2.102 


0.294 


0.312 


3.967 


0.348 


0.353 


9.856 



The average sensitivity (SN), positive predictive value (PPV), and CPU time (seconds) on different reference sets are shown for each sequence alignment scheme. 
All experiments were performed in Matlab on a MacPro workstation with 2 x 2.8 GHz Quad-Core Intel Xeon processors and 32GB memory. 



and the traditional pair-HMM approach. Each row 
shows the evaluation results on each of the six reference 
sets (i.e., RVll, RV12, RV20, RV30, RV40, RV50) in 
BAliBASE 3.0. For each reference set, we estimated the 
average SN, PPV, and CPU time (for estimating the 
alignment scores/probabilities) of different alignment 
schemes based on all possible pairwise sequence align- 
ments: 943 alignments for the reference set RVll, 2,335 
alignments for RV12, 50,062 alignments for RV20, 
76,370 alignments for RV30, 23,445 alignments for 
RV40, and 7,538 alignments for RV50. All experiments 
were performed using Matlab on a MacPro workstation 
with two 2.8 GHz Quad-Core Intel Xeon processors and 
32GB memory. 

From Table 1, we can clearly see that the proposed 
message-passing scheme significantly outperforms the 
pair-HMM approach in terms of SN and PPV, for all 
three values of A. For example, the message-passing 
scheme achieved up to 0.23 higher SN and 0.09 higher 
PPV for A = 0.25, and up to 0.37 higher SN and 0.19 
higher PPV for A = 0.75. Our experiments showed that 
a larger A tends to yield more accurate alignments, 
while a smaller A tends to make the algorithm converge 
faster, hence computationally more efficient. For exam- 
ple, when the weight parameter was set to A = 0.25, the 
message-passing scheme was around 2.3 ~ 2.5 times fas- 
ter than the pair-HMM, while still yielding much more 
accurate alignments. 

The results in Table 1 demonstrate that, on average, the 
proposed message-passing scheme considerably improves 
the quality of sequence alignment over the traditional pair- 
HMM approach. In order to see whether the proposed 
scheme also leads to a consistent improvement for most 
sequence pairs, we calculated the difference between SN^p 
(the sensitivity of the message-passing scheme) and 
S^HMM (the sensitivity of the pair-HMM-based approach) 
for every pairwise sequence alignment that we have per- 
formed in our experiments. Similarly, we calculated the dif- 
ference between PPVmp (the PPV of the message-passing 
scheme) and PPVhmm (the sensitivity of the pair-HMM 



approach) for all sequence pairs in BAliBASE 3.0. Figure 3 
shows the distributions of SNmp - SNhmm and PPVmp - 
PPVHMM for all sequence pairs. To avoid any bias from 
unsuccessful alignments, sequence pairs for which neither 
method yielded an alignment with at least one correct sym- 
bol alignment were excluded. The plots in the left column 
of Figure 3 show the distributions of SNmp - SNhmm< 
and those in the right column show the distributions of 
PPVmp - PPVhmm- The results obtained from the same 
reference set are shown in the same row, where the first 
row shows the results on RVll and the last row shows the 
results on RV50. As we can see in Figure 3, every single 
distribution shown in the figure has a much larger prob- 
ability mass in the right-half plane, which clearly demon- 
strates that the proposed message-passing scheme 
consistently outperforms the pair-HMM-based approach 
for most (though not all) sequence pairs. In many cases, 
the improvements in SN and PPV were quite significant 
(0.4 ~ 0.8), which shows that the proposed scheme can 
often find an accurate sequence alignment even when the 
pair-HMM has difficulty aligning the sequences. 

Conclusions 

In this paper, we proposed a novel method for sequence 
alignment based on an efficient message-passing approach. 
Given two biological sequences, the proposed method esti- 
mates the symbol alignment confidence scores for all pos- 
sible symbol pairs. These scores are iteratively computed 
by exchanging messages between neighboring symbol 
pairs, where empirical evidence shows that these scores 
quickly converge within several iterations. The proposed 
message-passing scheme effectively addresses a number of 
limitations of the traditional pair-HMM-based approach, 
and extensive performance assessment based on BAliBASE 
3.0 shows that the proposed scheme consistently outper- 
forms the pair-HMM approach, both in terms of align- 
ment accuracy and computational efficiency. Considering 
that pair-HMMs have been widely adopted by many mod- 
ern multiple sequence alignment algorithms [9-11], the 
proposed scheme has potentials to further improve the 
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Figure 3 Performance comparison between tfie proposed message-passing scheme and the traditional pair-l-IIVIIVI approach. The plots 
in the left column show the distributions of the sensitivity difference SN^p SNhmm between the message-passing scheme and the pair-HMM- 
based approach. In the right column, the distributions of the difference between the positive predictive values PPV^p - PPVhmm of the two 
schemes are shown. Each row shows the evaluation results obtained from each of the six reference sets in BAIiBASE 3.0. 
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current state-of-the-art. Furthermore, the proposed 
scheme is numerically stable even for extremely long 
sequences. Unlike the pair-HMM approach, there is no 
global measure or quantity (such as the observation prob- 
ability P (x, y) of the entire sequence pair) to be estimated, 
and the exchanged messages (i.e., symbol alignment confi- 
dence scores) are normalized after each iteration, which 
ensures that they lie within a reasonable numerical range. 
Finally, the simple iterative estimation process - in which 
the neighboring symbol pairs only exchange "local" mes- 
sages - makes the proposed message passing scheme 
amenable to massive parallelization through the utilization 
of modern GPU (graphics processing unit) architecture. 
These characteristics open up the possibility of applying 
the proposed message-passing scheme to accurate prob- 
abilistic alignment of genome-scale sequences, which has 
not been possible using traditional pair-HMMs. 

Finally, it is worth noting that the formula that is used 
to update Cxy(i, /) in the proposed message-passing algo- 
rithm bears conceptual similarity to the eigenvalue 
equation used by the network alignment algorithm 
called IsoRank [17] for estimating the functional similar- 
ity between proteins across different protein-protein 
interaction (PPI) networks. As demonstrated in [18,19], 
techniques that were originally developed for sequence 
alignment may also have potentials to improve network 
alignment methods. Conversely, techniques used in net- 
work alignment may also lead to better sequence align- 
ment methods. For example, the scoring scheme used 
by IsoRank can be viewed as a random walk [20], and it 
was shown that the use of a different random walk 
scheme can lead to more accurate network alignment 
results [19]. Similarly, it may be possible to modify the 
update formula for Cxy(/, /) to further improve the per- 
formance of the proposed message-passing scheme, and 
we are currently in the process of investigating several 
different implementations. 
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